Insurance fraud is a concern of many sectors such as health care, homeowners, and automobile. Insurance fraud is not only costly to insurers but also affects non fraudulent policy holders.
This analysis will focus on fraud within the auto insurance industry in India. The data used for this project was downloaded from Kaggle. https://www.kaggle.com/
Our goal is to use classification models for predicting which auto insurance claims are fraudulent. Several classification models will be assessed on their ability to successfully predict actual fraud.
individuals may not be interested in all sections of this analysis. Sections of interest can be directly access through the table of contents on the left. An example is if one would prefer going directly to the classification models clicking on Models will accopmplish that.
This project can be viewed with the accompanying code by following the below link.
The following programs were used for this project.
Python 3.10.10
R 4.2.2 (Specific Visualizations)
RStudio 2023.03.1+446 (For document output)
The data was downloaded in individual five data sets. We will review each data set for suitability of being merged into one data set.
## ************Train_Claim_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 19 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 DateOfIncident 28836 non-null object
## 2 TypeOfIncident 28836 non-null object
## 3 TypeOfCollission 28836 non-null object
## 4 SeverityOfIncident 28836 non-null object
## 5 AuthoritiesContacted 28836 non-null object
## 6 IncidentState 28836 non-null object
## 7 IncidentCity 28836 non-null object
## 8 IncidentAddress 28836 non-null object
## 9 IncidentTime 28836 non-null int32
## 10 NumberOfVehicles 28836 non-null int32
## 11 PropertyDamage 28836 non-null object
## 12 BodilyInjuries 28836 non-null int32
## 13 Witnesses 28836 non-null object
## 14 PoliceReport 28836 non-null object
## 15 AmountOfInjuryClaim 28836 non-null int32
## 16 AmountOfPropertyClaim 28836 non-null int32
## 17 AmountOfVehicleDamage 28836 non-null int32
## 18 AmountOfTotalClaim 28836 non-null int32
## dtypes: int32(7), object(12)
## memory usage: 3.4+ MB
## ************Train_Policy_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 10 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 InsurancePolicyNumber 28836 non-null int32
## 1 CustomerLoyaltyPeriod 28836 non-null int32
## 2 DateOfPolicyCoverage 28836 non-null object
## 3 InsurancePolicyState 28836 non-null object
## 4 Policy_CombinedSingleLimit 28836 non-null object
## 5 Policy_Deductible 28836 non-null int32
## 6 PolicyAnnualPremium 28836 non-null float64
## 7 UmbrellaLimit 28836 non-null int32
## 8 InsuredRelationship 28836 non-null object
## 9 CustomerID 28836 non-null object
## dtypes: float64(1), int32(4), object(5)
## memory usage: 1.8+ MB
## ************Train_Demographics_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 10 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 InsuredAge 28836 non-null int32
## 2 InsuredZipCode 28836 non-null int32
## 3 InsuredGender 28836 non-null object
## 4 InsuredEducationLevel 28836 non-null object
## 5 InsuredOccupation 28836 non-null object
## 6 InsuredHobbies 28836 non-null object
## 7 CapitalGains 28836 non-null int32
## 8 CapitalLoss 28836 non-null int32
## 9 Country 28836 non-null object
## dtypes: int32(4), object(6)
## memory usage: 1.8+ MB
## **********Traindata_with_Targeet_p Information**********
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 2 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 ReportedFraud 28836 non-null object
## dtypes: object(2)
## memory usage: 450.7+ KB
## ************Train_Vehicle_p Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 115344 entries, 0 to 115343
## Data columns (total 3 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 115344 non-null object
## 1 VehicleAttribute 115344 non-null object
## 2 VehicleAttributeDetails 115344 non-null object
## dtypes: object(3)
## memory usage: 2.6+ MB
## *************Train_Vehicle_p First 25 Rows*************
## CustomerID VehicleAttribute VehicleAttributeDetails
## 0 Cust20179 VehicleID Vehicle8898
## 1 Cust21384 VehicleModel Malibu
## 2 Cust33335 VehicleMake Toyota
## 3 Cust27118 VehicleModel Neon
## 4 Cust13038 VehicleID Vehicle30212
## 5 Cust1801 VehicleID Vehicle24096
## 6 Cust30237 VehicleModel RAM
## 7 Cust21334 VehicleYOM 1996
## 8 Cust26634 VehicleYOM 1999
## 9 Cust20624 VehicleMake Chevrolet
## 10 Cust14947 VehicleID Vehicle15216
## 11 Cust21432 VehicleYOM 2002
## 12 Cust22845 VehicleYOM 2000
## 13 Cust9006 VehicleMake Accura
## 14 Cust30659 VehicleYOM 2003
## 15 Cust18447 VehicleMake Honda
## 16 Cust19144 VehicleID Vehicle29018
## 17 Cust26846 VehicleID Vehicle21867
## 18 Cust4801 VehicleYOM 1998
## 19 Cust18081 VehicleYOM 2013
## 20 Cust17021 VehicleMake BMW
## 21 Cust30660 VehicleYOM 2002
## 22 Cust22099 VehicleID Vehicle30877
## 23 Cust33560 VehicleYOM 2011
## 24 Cust17371 VehicleYOM 2001
The data sets train claim, train policy, train demographics, and train with target are ready to be merged into one data set.
Viewing the first twenty-five rows of the Train Vehicle data column VehicleAttribute we can see that it has multiple repeating rows as each customerID is as associated with Vehicle Model, Vehicle Make, Vehicle ID, and Vehicle YOM. The number of rows is 115344 which is four times the rows of the other data sets. This data set will have to be modified before it can be merged with the other data sets. Each level should be an individual feature matching to its corresponding level in the VehicleAtributeDetails feature. This will be accomplished by making the Train Vehicle data set wider. We will spread out the Vehicle Attribute feature so each level will become a feature. This will create a new data set that is shorter and wider.
## ************train_vehicle_wide Information************
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 28836 entries, 0 to 28835
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 VehicleID 28836 non-null object
## 2 VehicleMake 28836 non-null object
## 3 VehicleModel 28836 non-null object
## 4 VehicleYOM 28836 non-null object
## dtypes: object(5)
## memory usage: 1.1+ MB
## *************train_vehicle_wide first 50 rows*************
## VehicleAttribute CustomerID VehicleID VehicleMake VehicleModel VehicleYOM
## 0 Cust10000 Vehicle26917 Audi A5 2008
## 1 Cust10001 Vehicle15893 Audi A5 2006
## 2 Cust10002 Vehicle5152 Volkswagen Jetta 1999
## 3 Cust10003 Vehicle37363 Volkswagen Jetta 2003
## 4 Cust10004 Vehicle28633 Toyota CRV 2010
## 5 Cust10005 Vehicle26409 Toyota CRV 2011
## 6 Cust10006 Vehicle12114 Mercedes C300 2000
## 7 Cust10007 Vehicle26987 Suburu C300 2010
## 8 Cust10009 Vehicle12490 Volkswagen Passat 1995
## 9 Cust1001 Vehicle28516 Saab 92x 2004
## 10 Cust10011 Vehicle8940 Nissan Ultima 2002
## 11 Cust10012 Vehicle9379 Ford Fusion 2004
## 12 Cust10013 Vehicle22024 Accura Fusion 2001
## 13 Cust10014 Vehicle3601 Suburu Impreza 2011
## 14 Cust10016 Vehicle7515 Saab 92x 2005
## 15 Cust10017 Vehicle31838 Saab 92x 2005
## 16 Cust10018 Vehicle35954 Toyota 93 2000
## 17 Cust10019 Vehicle19647 Saab 93 2000
## 18 Cust10021 Vehicle37694 Volkswagen Passat 2006
## 19 Cust10022 Vehicle31889 Toyota Highlander 1997
## 20 Cust10023 Vehicle10464 Toyota Highlander 1999
## 21 Cust10024 Vehicle24452 Dodge X5 2001
## 22 Cust10025 Vehicle12734 Dodge X5 2002
## 23 Cust10026 Vehicle14492 Volkswagen Passat 2001
## 24 Cust10027 Vehicle38970 Saab Passat 1995
## 25 Cust10028 Vehicle3996 Honda Accord 2015
## 26 Cust10029 Vehicle12477 Toyota Corolla 2015
## 27 Cust10030 Vehicle34293 Ford Forrestor 2006
## 28 Cust10031 Vehicle33775 Suburu F150 2005
## 29 Cust10032 Vehicle34708 Nissan Pathfinder 2012
## 30 Cust10034 Vehicle26030 Saab 92x 2006
## 31 Cust10035 Vehicle3961 Saab Jetta 2007
## 32 Cust10037 Vehicle38667 Dodge Neon 2012
## 33 Cust1004 Vehicle17051 Chevrolet Tahoe 2014
## 34 Cust10040 Vehicle7284 Audi Wrangler 2007
## 35 Cust10041 Vehicle2119 Jeep A3 2008
## 36 Cust10042 Vehicle7459 Accura A5 1997
## 37 Cust10043 Vehicle6244 Accura RSX 2010
## 38 Cust10044 Vehicle38446 Chevrolet Malibu 1998
## 39 Cust10046 Vehicle3199 Audi A5 2011
## 40 Cust10047 Vehicle13780 Audi A5 2009
## 41 Cust10049 Vehicle35318 Ford F150 2008
## 42 Cust1005 Vehicle26158 Accura RSX 2009
## 43 Cust10051 Vehicle33864 Dodge E400 2014
## 44 Cust10052 Vehicle16314 Honda Legacy 2002
## 45 Cust10053 Vehicle35570 Suburu Legacy 2000
## 46 Cust10054 Vehicle13054 Audi Ultima 2006
## 47 Cust10057 Vehicle23410 Suburu Legacy 2005
## 48 Cust10058 Vehicle24044 BMW 92x 2005
## 49 Cust10059 Vehicle25575 BMW X5 2006
We have taken the data from train vehicle and created a new data set called train vehicle wide. This new data set has four new columns and 28836 rows which now matches the other four data sets. We are now ready to merge all data sets.
## *******************fraud Information*******************
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28836 entries, 0 to 28835
## Data columns (total 42 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null object
## 1 DateOfIncident 28836 non-null object
## 2 TypeOfIncident 28836 non-null object
## 3 TypeOfCollission 28836 non-null object
## 4 SeverityOfIncident 28836 non-null object
## 5 AuthoritiesContacted 28836 non-null object
## 6 IncidentState 28836 non-null object
## 7 IncidentCity 28836 non-null object
## 8 IncidentAddress 28836 non-null object
## 9 IncidentTime 28836 non-null int32
## 10 NumberOfVehicles 28836 non-null int32
## 11 PropertyDamage 28836 non-null object
## 12 BodilyInjuries 28836 non-null int32
## 13 Witnesses 28836 non-null object
## 14 PoliceReport 28836 non-null object
## 15 AmountOfInjuryClaim 28836 non-null int32
## 16 AmountOfPropertyClaim 28836 non-null int32
## 17 AmountOfVehicleDamage 28836 non-null int32
## 18 AmountOfTotalClaim 28836 non-null int32
## 19 InsuredAge 28836 non-null int32
## 20 InsuredZipCode 28836 non-null int32
## 21 InsuredGender 28836 non-null object
## 22 InsuredEducationLevel 28836 non-null object
## 23 InsuredOccupation 28836 non-null object
## 24 InsuredHobbies 28836 non-null object
## 25 CapitalGains 28836 non-null int32
## 26 CapitalLoss 28836 non-null int32
## 27 Country 28836 non-null object
## 28 InsurancePolicyNumber 28836 non-null int32
## 29 CustomerLoyaltyPeriod 28836 non-null int32
## 30 DateOfPolicyCoverage 28836 non-null object
## 31 InsurancePolicyState 28836 non-null object
## 32 Policy_CombinedSingleLimit 28836 non-null object
## 33 Policy_Deductible 28836 non-null int32
## 34 PolicyAnnualPremium 28836 non-null float64
## 35 UmbrellaLimit 28836 non-null int32
## 36 InsuredRelationship 28836 non-null object
## 37 VehicleID 28836 non-null object
## 38 VehicleMake 28836 non-null object
## 39 VehicleModel 28836 non-null object
## 40 VehicleYOM 28836 non-null object
## 41 ReportedFraud 28836 non-null object
## dtypes: float64(1), int32(15), object(26)
## memory usage: 7.8+ MB
Feature engineering includes several steps.
First is feature creation. We create new variables from existing features which will help our model and data visualization.
Secondly, we can transform features from one representation to another. An example would be transforming a feature that is numerical to a type categorical.
Cleaning is the process of viewing the features and if something is not adding up with a feature, we can remove the values creating the problem or remove the feature entirely. An example is null values. We can replace a null value with another value, remove null values from the data set, or as mentioned before, remove the feature entirely.
Certain features are numeric yet may better serve our models as categorical. This can be assessed by checking unique values of these features
## ******** Unique Number of Vehicles********
## [3 1 4 2]
## ******** Unique Bodily Injuries********
## [1 2 0]
The above outputs indicate that both NumberOfVehcicles and BodilyInjuries would be best as type categorical. We will create a function that converts numerical data types to categorical. Then the function will be applied to the selected numerical features.
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28836 entries, 0 to 28835
## Data columns (total 2 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 NumberOfVehicles 28836 non-null category
## 1 BodilyInjuries 28836 non-null category
## dtypes: category(2)
## memory usage: 281.9 KB
Both features are now of type category
## *************Incident Time Unique Values*************
## [17 10 22 7 20 18 3 5 14 16 15 13 12 9 19 4 11 1 8 0 6 21 23 2
## -5]
IncidentTime has unique values that would warrant it becoming categorical, though the many levels would not be optimal for use in our modeling. We can remedy this by placing unique time values into bins using a Python dictionary. This will reduce the number of levels.
## ***Incident Period Day Value Counts***
## night 7458
## early afternoon 5785
## early morning 5580
## late morning 3661
## late afternoon 3231
## evening 2699
## Name: IncidentPeriodDay, dtype: int64
We find from the value count output for the new feature IncidentPeriodDay that incident times have been placed into six unique periods of the day.
Date features used in creatingnew features are no longer required and will be removed from the data set
For purposes of classification algorithms and visualizations we’ll need to convert all categorical columns (Object Data Type) to the category data type. This will be accomplished by creating a function to identify non-numerical columns and converting them to the category data type.
## *****************fraud_v3 Information*****************
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28836 entries, 0 to 28835
## Data columns (total 42 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 CustomerID 28836 non-null category
## 1 TypeOfIncident 28836 non-null category
## 2 TypeOfCollission 28836 non-null category
## 3 SeverityOfIncident 28836 non-null category
## 4 AuthoritiesContacted 28836 non-null category
## 5 IncidentState 28836 non-null category
## 6 IncidentCity 28836 non-null category
## 7 IncidentAddress 28836 non-null category
## 8 NumberOfVehicles 28836 non-null category
## 9 PropertyDamage 28836 non-null category
## 10 BodilyInjuries 28836 non-null category
## 11 Witnesses 28836 non-null category
## 12 PoliceReport 28836 non-null category
## 13 AmountOfInjuryClaim 28836 non-null int32
## 14 AmountOfPropertyClaim 28836 non-null int32
## 15 AmountOfVehicleDamage 28836 non-null int32
## 16 AmountOfTotalClaim 28836 non-null int32
## 17 InsuredAge 28836 non-null int32
## 18 InsuredZipCode 28836 non-null int32
## 19 InsuredGender 28836 non-null category
## 20 InsuredEducationLevel 28836 non-null category
## 21 InsuredOccupation 28836 non-null category
## 22 InsuredHobbies 28836 non-null category
## 23 CapitalGains 28836 non-null int32
## 24 CapitalLoss 28836 non-null int32
## 25 Country 28836 non-null category
## 26 InsurancePolicyNumber 28836 non-null int32
## 27 CustomerLoyaltyPeriod 28836 non-null int32
## 28 InsurancePolicyState 28836 non-null category
## 29 Policy_CombinedSingleLimit 28836 non-null category
## 30 Policy_Deductible 28836 non-null int32
## 31 PolicyAnnualPremium 28836 non-null float64
## 32 UmbrellaLimit 28836 non-null int32
## 33 InsuredRelationship 28836 non-null category
## 34 VehicleID 28836 non-null category
## 35 VehicleMake 28836 non-null category
## 36 VehicleModel 28836 non-null category
## 37 VehicleYOM 28836 non-null category
## 38 ReportedFraud 28836 non-null category
## 39 coverageIncidentDiff 28836 non-null float64
## 40 dayOfWeek 28836 non-null category
## 41 IncidentPeriodDay 28414 non-null category
## dtypes: category(28), float64(2), int32(12)
## memory usage: 5.3 MB
From the above output we observe that all object data types are now type categorical.
Figure 1
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0, 0.5, 'Multi-vehicle Collision'), Text(0, 1.5, 'Parked Car'), Text(0, 2.5, 'Single Vehicle Collision'), Text(0, 3.5, 'Vehicle Theft')])
## (array([0.5, 1.5, 2.5, 3.5]), [Text(0.5, 0, '?'), Text(1.5, 0, 'Front Collision'), Text(2.5, 0, 'Rear Collision'), Text(3.5, 0, 'Side Collision')])
Figure 2
We observe from the cross table that the ‘unknown’ type of collision is only associated with a small number of incident types related to collisions. These data points will be retained by renaming the “unknown” column to “none”.
Figure 3
Figure 4
From figure 4 we detect certain features that must be dealt with due to missing values. First, the property damage feature will be dropped due to many observations having no answer which is denoted by a question mark.
Next, the category MISSINGVALUE from the Witnesses feature will be dropped.
Figure 5
Figure 6
Figure Figure 6 informs us that there are additional categorical features which must be either cleaned or dropped. First, the feature Police Report has close to 10000 missing values (denoted by a question mark). This feature will be dropped.
The next feature requiring attention is InsuredGender. There are a small number of missing values, denoted by NA. This category will be removed from InsuredGender. The omission of this small count category will have no effect on our models.
Figure 7
## *******premium_missing shape*******
## (141, 40)
## *******fraud_v6 shape*******
## (28836, 40)
Figure 8
VehicleMake has a small number of missing values (denoted by ‘???’). The category ‘???’ will be removed from the feature.
## (0.0, 2535.75)
Figure 9
Figure 9 displays the VechicleMake feature with no missing values.
Filtering for any PolicyAnnualPremium value that is equal to -1 we find 141 values returned. From the Attribute Information pdf provided with the data set we know that -1 represents a missing value. All observations with -1 will be removed.
From the size output we can observe all values of -1 have been removed.
Certain visualizations require numeric only data. We’ll create a date set that contains only numeric data types.
## ******************Numeric Data Types******************
## AmountOfInjuryClaim int32
## AmountOfPropertyClaim int32
## AmountOfVehicleDamage int32
## AmountOfTotalClaim int32
## InsuredAge int32
## CapitalGains int32
## CapitalLoss int32
## CustomerLoyaltyPeriod int32
## Policy_Deductible int32
## PolicyAnnualPremium float64
## UmbrellaLimit int32
## coverageIncidentDiff float64
## dtype: object
The data set numeric_data only has features of numeric data types as seen from the above output.
Figure 10
There is very high to high correlation between Amount of Injury Claim, Amount of Property Claim, Amount of Vehicle Damage, and Amount of Total Claim. This is unsurprising as Amount of Total Claim is the sum of the other three. Amount of Total Claim is the only feature of the four that will be used for our machine learning models.
Other features exhibiting very high correlation are Loyalty period and Age. This makes sense as older customers have the chance to accrue loyalty time based on having lived longer than younger customers. Still, we will retain both features for our models.
Features not important for visualizing or building models will be dropped.
## **fraud_v8 shape**
## (28695, 33)
Figure 11
Based on the subplots from Figure 11, we observe that certain numeric features have outliers. We’ll take a closer look at those features.
Figure 13
The box plots from Figure 13 displaying Amount of Total Claim for the different events of ReportedFraud are interesting. For Reported Fraud=Y there are many outliers which are below 22000. Data points falling under 22000 for ReportedFraud=’N” is not considered outliers.
We’ll further check outliers by viewing the histograms from Figure 13. Jumping out is the distribution of AmountOfTotalClaim has two distinct peaks. That is, it is “bimodal”. The peaks in any distribution are the most common number(s). The distribution of Total Claim Reported is due to multiple values occurring most frequently. Data values that occur the most often in a data set is the mode.
The second histogram from Figure 13 superimposes the two events. The superimposed histogram follows the same bi-modal distribution as the single histogram. The outliers are of no concern and will not be removed from the data set.
Figure 14
The box plots from Figure 4 show outliers above the age of 60 for both reported fraud events. In addition, both histograms from Figure 14 have slight skews to the right. Looking closer at the subplots, this appears to be due to drivers over the age of 50. Drivers over the age of 50 or 60 seeking auto insurance coverage is not unusual. Since the outliers are not unusual, they will not be removed or transformed.
Figure 15
The boxplots from Figure 15 both have outliers at the lower and higher ends. There are outliers on both higher and lower ends of both box plots. It’s difficult to determine from the first histogram if there is a skew (tail). The mean of 1261 is slightly less than the median of 1266 which tells us there’s a small skew to the left. There are a few small values of Policy Annual Premium that are driving the mean down. The third plot is of two histograms superimposed based on Reported Fraud event. Reported Fraud=Y skews slight to the left. The mean of 1255 is less than the median of 1271 which supports the left skew. The histogram for Reported Fraud=N appears normally distributed which is when the mean and median are the same. The mean and median for Reported Fraud=N are the same at 1263 thus normal distribution is confirmed.
Based on the statistical analysis, the skew of the first histogram is primarily caused by lower premiums of data points reported as fraud. This data can be important during a model building. We’ll at addtional data for determining whether to keep these outliers.
## **** Year Of Make****
## 2015 416
## 1995 531
## 1996 828
## 2014 871
## 1997 1131
## 2013 1256
## 1998 1276
## 2012 1308
## 2001 1428
## 1999 1479
## 2011 1518
## 2000 1523
## 2002 1527
## 2003 1571
## 2008 1622
## 2009 1623
## 2010 1631
## 2005 1635
## 2006 1637
## 2004 1661
## 2007 1709
## Name: VehicleYOM, dtype: int64
Figure 16
Auto insurance premiums are generally based on personal details like choice of coverage, type of vehicle driven, and age of car. The newer the car, typically the more expensive the insurance. This is based on the vehicle’s replacement cost. The year the car was manufactured plays just as big a part in the premium as the make and model itself. Figure 16 displays all years of vehicle make in our data set. We find that there are just over 6000 auto’s that have an age of 15 years or greater compared to the last year of 2015. The number of older vehicles explains the outliers at the lower end of the box plots form Figure 15. These outliers are not unusual thus will not be removed or transformed.
Figure 17
The plots from Figure 17 are unusual. Both box plots have a
median of zero. Reported Fraud=Y has a mean of 1,000,000 while Reported
Fraud=“N” mean is 918,000. Both histograms have their peak at zero and a
log tail to the right.
There are only 7506 data points greater than zero. 2417 for Yes and 5089 for No. Data points greater than zero represent only 26 percent of the entire data set. Normally this would seem unusual, and we would review the raw data for errors. Checking the description of umbrella limit we find that such extreme data points are not uncommon. Umbrella insurance provides “excess liability insurance” beyond the liability insurance already in auto insurance coverage. It’s for expensive situations where medical bills and/or repairs exceed those in “base” auto policies. Auto policy holders who fall in the higher income brackets are usually the purchasers of umbrella limit. Thus, for all data points, the mean of 972,000 and max of 10,000,000 are common values. Additionally, the mean of zero is not unsurprising as not many insured choose umbrella limits for their policies. Due to the small percentage of insured with umbrella limits we will exclude this feature from our models.
Figure 18
Figure 18 displays bar plots of categories belonging to the feature ‘severity of incident’ stacked based on whether fraud is ‘Y’ or ‘N’. ‘Major Damage’ stands out as 60% of claims are reported as fraud whereas the other categories have claims reported as fraud under 16%.
Figure 19
Figure 20
Figure 21
Figure 19 presents each vehicle make with percentages reported as fraud “Y” and “N”. We find that Volkswagen, Mercedes, Ford, BMW, and Audi are the vehicle makes with reported fraud over 30%. This is an interesting statistic though due to the large number of categories we’ll explore the ‘Vehicle Make’ feature further.
Box plots from Figure 20 show the median total claims is roughly the same for all models.
We find from Figure 21 that Nissan, Subaru, and Toyota have a median capital gain near 20,000, substantially larger than all other makes. The vehicle makes over 30% reported fraud from Figure 19 all have zero medians.
Due to the number of categories of “Vehicle Make” we will exclude it from the modeling process.
Figure 22
Figure 23
Figure 24
From Figure 22 we find there are four incident states in which claims reported as fraud is over 30%. State3 has over 40% claims reported as fraud.
Figure 23 and Figure 24 do not exhibit differences of vehicle makes in regards to total claims and capital gains.
We’ll retain the ‘Incident State’ feature as it has half the categories as ’Vehicle Make”.
Figure 25
From Figure 25 two categories stand out with respect to reported fraud. ‘Single Vehicle Collision’ and ‘Multi-vehicle collision’ from the feature ‘Type of Incident’ have claims reported as fraud at 31% and 29% respectively. The other two categories are under 14%.
Before model building can start, we’ll need to perform pre-processing. This will entail splitting our data into training, validation, and test sets along with transforming numerical and categorical features into classification friendly formats.
Select features
## The Target categories: Index(['N', 'Y'], dtype='object'):
We will separate the data to get predictor features and target features into separate data frames.
The data type of the target feature is categorical. Most machine learning algorithms require numerical data types. The target feature y will transformed to a numeric tpye.
## dtype('int64')
From the two above outputs we can see that the target feature has been converted into binary form though its data type is integer. The data type must be converted back to categorical.
## Target Feature categories as binary: Int64Index([0, 1], dtype='int64'):
## Target Feature as categorical binary:
##
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28181 entries, 0 to 28835
## Data columns (total 1 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 ReportedFraud 28181 non-null category
## dtypes: category(1)
## memory usage: 247.8 KB
## Shape of Predictor Features is (28181, 23):
## Shape of Target Feature is (28181, 1):
The makup of the X data frame is 28836 rows and 26 columns. The y data frame has the same number of rows, 28836, and one column, the target feature.
## *********** X Structure***********
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 28181 entries, 0 to 28835
## Data columns (total 23 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 TypeOfIncident 28181 non-null category
## 1 TypeOfCollission 28181 non-null category
## 2 SeverityOfIncident 28181 non-null category
## 3 AuthoritiesContacted 28181 non-null category
## 4 IncidentState 28181 non-null category
## 5 NumberOfVehicles 28181 non-null category
## 6 BodilyInjuries 28181 non-null category
## 7 Witnesses 28181 non-null category
## 8 AmountOfTotalClaim 28181 non-null int32
## 9 InsuredAge 28181 non-null int32
## 10 InsuredGender 28181 non-null category
## 11 CapitalGains 28181 non-null int32
## 12 CapitalLoss 28181 non-null int32
## 13 CustomerLoyaltyPeriod 28181 non-null int32
## 14 InsurancePolicyState 28181 non-null category
## 15 Policy_CombinedSingleLimit 28181 non-null category
## 16 Policy_Deductible 28181 non-null int32
## 17 PolicyAnnualPremium 28181 non-null float64
## 18 UmbrellaLimit 28181 non-null int32
## 19 InsuredRelationship 28181 non-null category
## 20 coverageIncidentDiff 28181 non-null float64
## 21 dayOfWeek 28181 non-null category
## 22 IncidentPeriodDay 28181 non-null category
## dtypes: category(14), float64(2), int32(7)
## memory usage: 1.8 MB
We will now split X,y into Train, Validation and Test sets
## Shape of X Train (19726, 23):
## Shape of X Valid (4227, 23):
## Shape of X Test (4228, 23):
## Shape of y Train (19726, 1):
## Shape of y Valid (28181, 1):
## Shape of y Test (4228, 1):
y features will be transformed to numpy array
## Shape of y Train rv (19726, 1):
## Shape of y Valid rv (28181, 1):
## Shape of y Test rv (4228, 1):
Next we will transform y features to one dimensional arrays
## Shape of y Train rv (19726,):
## Shape of y Valid rv (4227,):
## Shape of y Test rv (4228,):
From the above output we see that y train, y valid, and y test have been transformed into one dimensional numpy arrays.
Our next step is to transform the predictor features into acceptable machine learning formats.
Transformation for numerical features is performed by scaling. Scaling prevents a feature with a range let’s say in the thousands from being considered more important than a feature having a lower range. Scaling places features at the same importance before being applied to a machine learning algorithm. There are different methods used in scaling features, for this analysis we’ll be using standard scaling. Standard scaling transforms the data to have zero mean and a variance of one, thus making the data unitless.
Most machine learning algorithms only accept numerical features which makes categorical features unacceptable in their original form. Thus, we need to encode categorical features into numerical values. The act of replacing categories with numbers is called categorical encoding. For this we will use one-hot encoding. Categorical features are represented as a group of binary features, where each binary feature represents one category. The binary feature takes the integer value 1 if the category is present, or 0 otherwise.
First, we will create transformed train, valid, and test sets for the Logistic Regression model. This entails dropping the first category of each feature during One Hot Encoding.
## ************First Five Rows X_train_lr************
## scale__AmountOfTotalClaim ... ohe__IncidentPeriodDay_night
## 10858 -1.839897 ... 0.0
## 10084 1.167314 ... 0.0
## 1826 -1.842925 ... 0.0
## 27377 1.142850 ... 0.0
## 19587 -1.822924 ... 0.0
##
## [5 rows x 63 columns]
## ************First Five Rows X_valid_lr************
## scale__AmountOfTotalClaim ... ohe__IncidentPeriodDay_night
## 16124 -0.628661 ... 1.0
## 17479 0.521933 ... 0.0
## 6203 -0.063166 ... 0.0
## 14876 0.921840 ... 0.0
## 493 0.549704 ... 0.0
##
## [5 rows x 63 columns]
## ************First Five Rows X_test_lr************
## scale__AmountOfTotalClaim ... ohe__IncidentPeriodDay_night
## 9540 0.812948 ... 0.0
## 17332 1.046629 ... 0.0
## 13547 -0.089223 ... 1.0
## 21248 0.672341 ... 1.0
## 10040 1.560886 ... 0.0
##
## [5 rows x 63 columns]
We see from the first five rows of the train, valid, and test sets that the features have been transformed while at the same time retaining the column feature names.
## Shape of X Train lr (19726, 63):
## Shape of X Valid lr (4227, 63):
## Shape of X Test lr (4228, 63):
Next, we transform training, valid, and test sets for all other models. During OneHotEncoding, the first category will be dropped only if the feature is binary.
Transform X_train
## ************First Five Rows X_train_tr************
## num__AmountOfTotalClaim ... cat__IncidentPeriodDay_night
## 10858 -1.839897 ... 0.0
## 10084 1.167314 ... 0.0
## 1826 -1.842925 ... 0.0
## 27377 1.142850 ... 0.0
## 19587 -1.822924 ... 0.0
##
## [5 rows x 76 columns]
## ************First Five Rows X_valid_tr************
## num__AmountOfTotalClaim ... cat__IncidentPeriodDay_night
## 16124 -0.628661 ... 1.0
## 17479 0.521933 ... 0.0
## 6203 -0.063166 ... 0.0
## 14876 0.921840 ... 0.0
## 493 0.549704 ... 0.0
##
## [5 rows x 76 columns]
## ************First Five Rows X_test_tr************
## num__AmountOfTotalClaim ... cat__IncidentPeriodDay_night
## 9540 0.812948 ... 0.0
## 17332 1.046629 ... 0.0
## 13547 -0.089223 ... 1.0
## 21248 0.672341 ... 1.0
## 10040 1.560886 ... 0.0
##
## [5 rows x 76 columns]
## Shape of X Train tr (19726, 76):
## Shape of X Valid tr (4227, 76):
## Shape of X Test tr (4228, 76):
From the shape output we find there are 13 additional columns compared to the logistic regression transformed data.
For the purpose of evaluating model performance, the event of interest we are interested in is if reported fraud is yes. This is considered the positive class. Classification metrics are used to determine how well our models predict the event of interest.
Accuracy-measures the number of predictions that are correct as a percentage of the total number of predictions that are made. As an example, if 90% of your predictions are correct, your accuracy is simply 90%. Calculation: number of correct predictions/Number of total predictions. TP+TN/(TP+TN+FP+FN)
Precision-tells us about the quality of positive predictions. It may not find all the positives but the ones that the model does classify as positive are very likely to be correct. As an example, out of everyone predicted to have defaulted, how many of them actually did default? So within everything that has been predicted as a positive, precision counts the percentage that is correct. Calculation: True positives/All Positives. TP/(TP+FP)
Recall- tells us about how well the model identifies true positives. The model may find a lot of positives yet it also will wrongly detects many positives that are not actually positives. Out of all the patients who have the disease, how many were correctly identified? So within everything that actually is positive, how many did the model successfully to find. A model with low recall is not able to find all (or a large part) of the positive cases in the data. Calculated as: True Positives/(False Negatives + True Positives)
F1 Score-The F1 score is defined as the harmonic mean of precision and recall.
The harmonic mean is an alternative metric for the more common arithmetic mean. It is often useful when computing an average rate. https://en.wikipedia.org/wiki/Harmonic_mean
The formula for the F1 score is the following: 2 times((precision*Recall)/(Precision + Recall))
Since the F1 score is an average of Precision and Recall, it means that the F1 score gives equal weight to Precision and Recall:
LogisticRegression(C=1, max_iter=10000, solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=1, max_iter=10000, solver='newton-cg')
The above displays gives us the parameters chosen for the logistic regression model.
## ****Logistic RegressionValidation Classification Report****
## precision recall f1-score support
##
## 0 0.83 0.90 0.86 3097
## 1 0.64 0.50 0.56 1130
##
## accuracy 0.79 4227
## macro avg 0.73 0.70 0.71 4227
## weighted avg 0.78 0.79 0.78 4227
Figure 26
We will now look at Feature Importance. Feature Importance is a score assigned to the features of a Machine Learning model that defines how “important” is a feature to the model’s prediction.
For the logistcal regression model we took the absolute value of the coefficients so as to get the Importance of the feature both with negative and positive effect.
Now that we have the importance of the features we will now transform the coefficients for easier interpretation. The coefficients are in log odds format. We will transform them to odds-ratio format.
## ******************Top Five Coefficients******************
## Feature Exp_Coefficient
## 50 ohe__InsuredRelationship_unmarried 1.634025
## 48 ohe__InsuredRelationship_other-relative 1.521623
## 34 ohe__Witnesses_2 1.503439
## 47 ohe__InsuredRelationship_not-in-family 1.467964
## 58 ohe__IncidentPeriodDay_early morning 1.353613
Support Vector Machine (the “road machine”) is responsible for finding the decision boundary to separate different classes and maximize the margin. A decision boundary differentiates two classes. A data point falling on either side of the decision boundary can be attributed to different classes. Binary classes would be either yes or no.
SVC(C=1, gamma=0.1, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=1, gamma=0.1, random_state=0)
The above displays gives us the parameters chosen for the support vector machine Model.
## **********SVC Validation Classification Report**********
## precision recall f1-score support
##
## 0 0.92 0.98 0.95 3097
## 1 0.93 0.76 0.84 1130
##
## accuracy 0.92 4227
## macro avg 0.92 0.87 0.89 4227
## weighted avg 0.92 0.92 0.92 4227
Decision trees use a flowchart like a tree structure to show the predictions that result from a series of feature splits. To accomplish this, a decision tree is made up of three types of nodes:
Root Node (parent node): The node that starts the graph. It evaluates the variable that best splits the data.
Intermediate Nodes (child nodes): These are nodes where features are evaluated for further splits of the data but are not the final nodes.
Leaf nodes (terminal nodes): These are the final nodes of the tree, where the prediction of a categorical event are made.
For a more detailed explanation of decision trees check the link below.
Parameter Definitions:
min_samples_split - Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
min_samples_leaf - The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches.Defines the minimum samples (or observations) required in a terminal node or leaf.
max_depth - Determines the length of the tree which is the same as the number of splitting rounds. T
max_features - The number of features to consider when looking for the best split
DecisionTreeClassifier(max_depth=1000, max_features=0.5, min_samples_leaf=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=1000, max_features=0.5, min_samples_leaf=5)
The above displays gives us the parameters chosen for the Decision Tree Model.
## ****Decision Tree Classification Report****
## precision recall f1-score support
##
## 0 0.73 0.74 0.74 3097
## 1 0.27 0.26 0.26 1130
##
## accuracy 0.61 4227
## macro avg 0.50 0.50 0.50 4227
## weighted avg 0.61 0.61 0.61 4227
Figure 37
Figure 28
Random forest is an ensemble learning method. Ensemble learning takes predictions from multiple models are merges them to enhance the accuracy of prediction. There are four types of ensemble techniques. We’ll be using Bagging (which random forest is an example of) and boosting, which our next four models will be an example of.
Bagging involves fitting many decision trees on different samples of the same dataset and averaging the predictions.
Random Forest models are made up of individual decision trees whose predictions are combined for a final result. The final result is decided using majority rules which means that the final prediction is what the majority of the decision tree models chose. Random Forests can be made up of thousands of decision trees.
Simply put, the random forest builds multiple decision trees and merges them together to get a more accurate prediction.
The individual models of random forest are decision trees. The decision trees predictions are combined for reaching a result. Majority rules is the process by which the Random Forest Classifier determines the result. An example would be 5 models in which 3 of the 5 models predict ‘yes’ for the classification problem.
Parameter Definitions (Not previously defined):
n_estimators - the number of trees to construct
RandomForestClassifier(max_depth=1000, max_features=0.5, min_samples_leaf=3,
min_samples_split=6, n_estimators=5000, n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_depth=1000, max_features=0.5, min_samples_leaf=3,
min_samples_split=6, n_estimators=5000, n_jobs=-1)The above display presents the parameters chosen for the random Forest classifier.
## ****Random Forest Validation Classification Report****
## precision recall f1-score support
##
## 0 0.90 0.97 0.93 3097
## 1 0.91 0.70 0.79 1130
##
## accuracy 0.90 4227
## macro avg 0.90 0.84 0.86 4227
## weighted avg 0.90 0.90 0.90 4227
Figure 29
Figure 30
Boosting learns from the mistakes of individual trees. Each new tree is built from the previous tree. We’ll be using five boosting algorithms, the first being AdaBoost.
In AdaBooost, a new tree adjusts based on the previous tree by
adjusting its weights based on errors from that previous tree.
Observations have an assigned weight, and each tree is built in an
additive manner, assigning greater weights (more importance) to
misclassified observations in the previous learners. Misclassified would
be predicted yes but actual is no. False Errors from previous trees are
weak learners. Weak learners perform no better than a random guess.
Parameter Definitions (Not previously defined)
Learning rate - Shrinks the contribution of individual trees for each round of boosting so that no tree has too much influence. Basically, learning rate limits the influence of individual trees. By lowering the learning rate, more trees are required to produce better scores. Lowering learning rate prevents over fitting because the size of weights carried forward is smaller.
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
learning_rate=1, n_estimators=5000, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
learning_rate=1, n_estimators=5000, random_state=1)DecisionTreeClassifier(max_depth=6)
DecisionTreeClassifier(max_depth=6)
The above display presents the paramters chosen for the AdaBoost model.
## ****AdaBoost Validation Classification Report****
## precision recall f1-score support
##
## 0 0.91 0.98 0.94 3097
## 1 0.92 0.72 0.81 1130
##
## accuracy 0.91 4227
## macro avg 0.91 0.85 0.87 4227
## weighted avg 0.91 0.91 0.90 4227
Figure 31
Figure 32
Gradient boosting also uses incorrect predictions from previous trees to adjust the next tree though this is accomplished by fitting each new tree based on the errors of the previous tree’s predictions. Mistakes from the previous trees are used to build a new tree solely around these mistakes. As mentioned early in AdaBoost, gradient boosting is taking these errors (weak learner) and making them a strong learner. The difference is the gradient boost algorithm only uses the errors from the previous tree in contrast to AdaBoost.
The main idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model. Errors are reduced by building a new model on the errors or residuals of the previous model.
parameter definitions (not previously defined)
Criterion - The loss function used to find the optimal feature and threshold to split the data
base learner - Is the initial decision tree. It’s the first leaner in the process Subsample - A subset of samples. A subset of rows means not all rows may be included when building each tree. The percentage of each boosting round is limited.
max_leaf_nodes - The maximum number of terminal nodes or leaves in a tree.
n_iter_no_change - is used to decide if early stopping will be used to terminate training when validation score is not improving.
tol - Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.
ccp_alpha - Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen
GradientBoostingClassifier(max_depth=15, max_features=9, min_samples_leaf=60,
min_samples_split=1000, n_estimators=4000,
subsample=0.7, warm_start=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(max_depth=15, max_features=9, min_samples_leaf=60,
min_samples_split=1000, n_estimators=4000,
subsample=0.7, warm_start=True)The above display presents the parameters chosen for the gradient boost model.
## ****Gradient Boosting Classification Report****
## precision recall f1-score support
##
## 0 0.92 0.97 0.94 3097
## 1 0.89 0.75 0.82 1130
##
## accuracy 0.91 4227
## macro avg 0.90 0.86 0.88 4227
## weighted avg 0.91 0.91 0.91 4227
Figure 33
Figure 34
Extreme Gradient boosting is similar to gradient boosting with a few improvements. First, enhancements make it faster than other ensemble methods. Secondly, built-in regularization allows it to have an advantage in accuracy. Regularization is the process of adding information to reduce variance and prevent over fitting.
Paramter Definitions (Not previously defined):
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.3, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.1, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=10, max_leaves=None,
min_child_weight=5, missing=nan, monotone_constraints=None,
n_estimators=5000, n_jobs=-1, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.3, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.1, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=10, max_leaves=None,
min_child_weight=5, missing=nan, monotone_constraints=None,
n_estimators=5000, n_jobs=-1, num_parallel_tree=None,
predictor=None, random_state=None, ...)The above display presents the parameters chosen for the XGBoost model.
## ****XGBoost Validation Report Classification Report****
## precision recall f1-score support
##
## 0 0.93 0.95 0.94 3097
## 1 0.86 0.81 0.83 1130
##
## accuracy 0.91 4227
## macro avg 0.90 0.88 0.89 4227
## weighted avg 0.91 0.91 0.91 4227
Figure 35
Figure 36
Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithms grow level-wise (horizontally). It will choose the leaf with max delta (change) loss to grow.
parameter definition (not previously defined)
LGBMClassifier(colsample_bytree=0.7, learning_rate=0.01, max_depth=6,
metric='None', min_child_samples=1, n_estimators=5000,
random_state=314, subsample=0.8)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LGBMClassifier(colsample_bytree=0.7, learning_rate=0.01, max_depth=6,
metric='None', min_child_samples=1, n_estimators=5000,
random_state=314, subsample=0.8)The above display presents the pararmeters chosen for the light gradient boost model
## ****Light Gradient Boosting Classification Report****
## precision recall f1-score support
##
## 0 0.91 0.98 0.94 3097
## 1 0.93 0.74 0.82 1130
##
## accuracy 0.92 4227
## macro avg 0.92 0.86 0.88 4227
## weighted avg 0.92 0.92 0.91 4227
Figure 37
Figure 39
Figure 40
From the F1 scores above we find that the logistic regression and decision tree models are below average in predicting the event of interest. The other models have F1 scores that suggest above average capability in predicting the event of interest.
We’ll look deeper into the scores by evaluating precision and recall. Precision is the metric we’ll focus on as the higher the score the lower false positives are predicted. We do not wish to inaccurately accuse a policy holder of fraud when there is none. On the other hand, a higher recall means lower false negatives (predicting no when it is yes), which we are not concerned with for this analysis.
The extreme gradient boost model has the best balance between precision and recall. It’s the only model with a recall score above 0.80. The higher recall score comes at the sacrifice as its precision score is slightly lower compared to the other models. We can visualize this by viewing the confusion matrix for each model. Viewing Figure 35 we see that the extreme gradient model classified 912 insurance claims (lower left box) as fraudulent that were fraudulent. This is considered a true positive. The next highest model with correct fraudulent classified claims is gradient boost with 852 (Figure 33.) Alternatively, the extreme gradient boost model classified 147 claims as fraudulent (upper right box) that were not fraudulent. This is considered a false negative. Random forest (Figure 29), gradient boost (Figure 35), ada boost (Figure 31), and light gradient boost (Figure 37) all have lower false positives at 79, 104, 72, and 67 respectively.
Interestingly, the extreme gradient boost model’s important features are different than the other ensemble models. While all have one significantly important feature, extreme gradient boost has no other features of importance over three percent (Figure 36). This differed even from gradient boost (Figure 34) which essentially is the same algorithm. This may be play as part in the ballade of recall and precision scores.
The remaining models have a precision score above .89. These models prove very good at predicting the event of interest (low false positives). A score of 0.90 means only 1 out of 10 policy holders are incorrectly predicted of fraudulent .
We are now going to take models with a precision score above .80 and fit to unseen data (test set)
## ******SVC Test Set Classification Report******
## precision recall f1-score support
##
## 0 0.92 0.98 0.95 3053
## 1 0.93 0.79 0.86 1175
##
## accuracy 0.93 4228
## macro avg 0.93 0.88 0.90 4228
## weighted avg 0.93 0.93 0.92 4228
## ******Random Forest Test Set Classification Report******
## precision recall f1-score support
##
## 0 0.91 0.97 0.94 3053
## 1 0.90 0.74 0.81 1175
##
## accuracy 0.90 4228
## macro avg 0.90 0.85 0.87 4228
## weighted avg 0.90 0.90 0.90 4228
## ******AdaBoost Test Set Classification Report******
## precision recall f1-score support
##
## 0 0.91 0.97 0.94 3053
## 1 0.90 0.74 0.81 1175
##
## accuracy 0.91 4228
## macro avg 0.90 0.86 0.88 4228
## weighted avg 0.90 0.91 0.90 4228
## ******Gradient Boost Test Set Classification Report******
## precision recall f1-score support
##
## 0 0.92 0.96 0.94 3053
## 1 0.87 0.78 0.82 1175
##
## accuracy 0.91 4228
## macro avg 0.89 0.87 0.88 4228
## weighted avg 0.90 0.91 0.90 4228
## ******Extreme Gradient Boost Test Set Classification Report******
## precision recall f1-score support
##
## 0 0.93 0.94 0.94 3053
## 1 0.85 0.83 0.84 1175
##
## accuracy 0.91 4228
## macro avg 0.89 0.89 0.89 4228
## weighted avg 0.91 0.91 0.91 4228
## ******Light Gradient Boost Test Set Classification Report******
## precision recall f1-score support
##
## 0 0.92 0.97 0.94 3053
## 1 0.91 0.77 0.83 1175
##
## accuracy 0.91 4228
## macro avg 0.91 0.87 0.89 4228
## weighted avg 0.91 0.91 0.91 4228
Figure 41
The suport vector machine test precision score remained the same as its validation precision score. The ensemble models had small decreases in their test precision scores compared to their validation scores. This indicates possible slight over fitting.
We will build models usiing algorithms from the previous models with precision scores above 0.88 and evaluate them with cross validation.
## *****X_cv Shape*****
## (28181, 23)
## *****y_cv Shape*****
## (28181, 1)
## *****X_train_cv Shape*****
## (19726, 23)
## *****X_test_cv Shape*****
## (8455, 23)
## *****y_train_cv Shape*****
## (19726, 1)
## *****y_test_cv Shape*****
## (8455, 1)
## *****y_train_cv Type*****
## <class 'pandas.core.frame.DataFrame'>
## *****X_test_cv Type*****
## <class 'pandas.core.frame.DataFrame'>
## *****y_train_cv_np Shape*****
## (19726, 1)
## *****y_test_cv_np Shape*****
## (8455, 1)
## *****y_train_cv_rv Shape*****
## (19726,)
## *****y_test_cv_rv Shape*****
## (8455,)
## *****X_train_cv_tr Shape*****
## (19726, 76)
## *****X_test_cv_tr Shape*****
## (8455, 76)
For the next models we will use cross validation on our training set.
Cross validation partitions the training set into equal subsets. The subsets will be used to assess a models performance on training data.
The process works by setting aside the first fold as a test set and the remaining subsets are used as the aggregated training set. The model is trained on the aggregated training set then the performance is evaluated on the testing set. This will continue until all folds have been held out as a test set. An evaluation metric is calculated for each iteration then averaged together. This results in a cross validated metric.
This allows us to evaluate our model on different test sets without having to expose the model to the actual test set.
Stratified fold includes the same percentage of target values in each fold. This will set the number of folds used in our cross validation which for this analysis will be five.
SVC(C=1, gamma=0.1, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=1, gamma=0.1, random_state=0)
RandomForestClassifier(max_depth=1000, max_features=0.25, min_samples_leaf=3,
min_samples_split=4, n_estimators=5000, n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_depth=1000, max_features=0.25, min_samples_leaf=3,
min_samples_split=4, n_estimators=5000, n_jobs=-1)AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
learning_rate=1, n_estimators=10000, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
learning_rate=1, n_estimators=10000, random_state=1)DecisionTreeClassifier(max_depth=6)
DecisionTreeClassifier(max_depth=6)
GradientBoostingClassifier(learning_rate=0.075, max_depth=7, max_features=11,
min_samples_leaf=50, min_samples_split=1000,
n_estimators=1000, subsample=0.75, warm_start=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.075, max_depth=7, max_features=11,
min_samples_leaf=50, min_samples_split=1000,
n_estimators=1000, subsample=0.75, warm_start=True)LGBMClassifier(colsample_bytree=0.9, learning_rate=0.01, max_depth=6,
metric='None', min_child_samples=10, n_estimators=7500,
random_state=314, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. LGBMClassifier(colsample_bytree=0.9, learning_rate=0.01, max_depth=6,
metric='None', min_child_samples=10, n_estimators=7500,
random_state=314, subsample=0.7)Figure 42
## ******SVC Final Test Set Classification Report******
## precision recall f1-score support
##
## 0 0.92 0.98 0.95 6164
## 1 0.92 0.78 0.85 2291
##
## accuracy 0.92 8455
## macro avg 0.92 0.88 0.90 8455
## weighted avg 0.92 0.92 0.92 8455
## *******Random Forest Final Test Set Classification Report********
## precision recall f1-score support
##
## 0 0.90 0.97 0.93 6164
## 1 0.90 0.71 0.80 2291
##
## accuracy 0.90 8455
## macro avg 0.90 0.84 0.87 8455
## weighted avg 0.90 0.90 0.90 8455
## *******AdaBoost Final Test Set Classification Report********
## precision recall f1-score support
##
## 0 0.91 0.97 0.94 6164
## 1 0.90 0.73 0.81 2291
##
## accuracy 0.91 8455
## macro avg 0.90 0.85 0.87 8455
## weighted avg 0.91 0.91 0.90 8455
## *******Gradient Boost Final Test Set Classification Report********
## precision recall f1-score support
##
## 0 0.92 0.97 0.94 6164
## 1 0.90 0.77 0.83 2291
##
## accuracy 0.91 8455
## macro avg 0.91 0.87 0.88 8455
## weighted avg 0.91 0.91 0.91 8455
## *****Light Gradient Boost Final Test Set Classification Report******
## precision recall f1-score support
##
## 0 0.92 0.97 0.94 6164
## 1 0.90 0.77 0.83 2291
##
## accuracy 0.91 8455
## macro avg 0.91 0.87 0.89 8455
## weighted avg 0.91 0.91 0.91 8455
Figure 43
Figure 44
Figure 45
Figure 46
Figure 47
There was no change between cross validation and test precision scores for support vector machines and AdaBoost models (Figures 42 and 43). The models random forest, gradient boost, and light gradient boost improved by one percent in the test set precision score compared to the cross validation set (Figures 42 and 43). Based on the cross validation and test precision scores we can conclude there is no overfitting of our final models.
Our final models proved very good at classifying fraudulent claims. In addition, all models minimized incorrectly classifying non-fraudulent claims as fraudulent as only one out of ten claims were incorrectly classified. For predicting on outside unseen data, any of our final models will perform well. Support vector machines proved best at classifying fraudulent claims albeit only by a two percent margin over the other models.
For interpretability, the ensemble models may be a better choice. Each has common features in their top five most important. However, the ensemble models are split when comparing the effect of the highest important feature. Random forest (Figure 44) and gradient boost (Figure 46) have the categorical feature “Severity of Incident-Major Damage” as the most important feature. This feature’s importance is twice that of the next closest feature. In comparison, Adaboost (Figure 45) and light gradient boost (Figure 47) have as their top five important features all numerical features. Additionally, these models’ importance’s are grouped closer together such that no one feature dominates the classification of events. These differences may not be important as all perform well in the task at hand.